C/C++ Users Group Library 1996 July

home *** CD-ROM | disk | FTP | other *** search

/ C/C++ Users Group Library 1996 July / C-C++ Users Group Library July 1996.iso / vol_100 / 173_01 / lex.mem < prev next >

Wrap

Text File | 1980-01-01 | 68KB | 1,698 lines

/* HEADER: CUG nnn.nn; TITLE: LEX - A Lexical Analyser Generator VERSION: 1.0 for IBM-PC DATE: Sep 29, 195 DESCRIPTION: A Lexical Analyser Generator. From UNIX KEYWORDS: Lexical Analyser Generator YACC C PREP SYSTEM: IBM-PC and Compatiables FILENAME: LEX.MEM WARNINGS: This program is not for the casual user. It will be useful primarily to expert developers. CRC: N/A SEE-ALSO: YACC and PREP AUTHORS: Scott Guthery 11100 leadfwood lane Austin, TX 78750 COMPILERS: DESMET-C REFERENCES: UNIX Systems Manuals */ DECUS C LANGUAGE SYSTEM LEX A lexical Analyser Generator by Charles H. Forsyth University of Waterloo Waterloo, Ontario, N2L 3G1 Canada Revised by Robert B. Denny & Martin Minow LEX transforms a regular-expression grammar and associated action routines into a C function and set of tables, yielding a table-driven lexical analyser which manages to be compact and rapid. DECUS Structured Languages SIG Version of 30-Oct-82 Copyright (C) 1978, Charles H. Forsyth Modifications Copyright (C) 1980, 1982, DECUS General permission to copy or modify, but not for profit, is hereby granted, provided that the above copyright notice is included and reference made to the fact that reproduction privileges were granted by DECUS. The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation or by DECUS. Neither Digital Equipment Corporation, DECUS, nor the authors assume any responsibility for the use or reliability of this document or the described software. This software is made available without any support whatsoever. The person responsible for an implementation of this system should expect to have to understand and modify the source code if any problems are encountered in implementing or maintaining the compiler or its run-time library. The DECUS 'Structured Languages Special Interest Group' is the primary focus for communication among users of this software. UNIX is a trademark of Bell Telephone Laboratories. RSX, RSTS/E, RT-11 and VMS are trademarks of Digital Equipment Corporation. A Lexical Analyser Generator 1.0 Introduction A computer program often has an input stream which is composed of small elements, such as a stream of characters, and which it would like to convert to larger elements in order to process the data conveniently. A compiler is a common example of such a program: it reads a stream of characters forming a program, and would like to turn this sequence of characters into a sequence of larger items, namely identifiers, numbers, and operators, for parsing. In a compiler, the procedures which do this are collectively called the lexical analyser, or scanner; this terminology will be used more generally here. It may happen that the speed with which this transformation is done will noticeably affect the speed at which the rest of the program operates. It is certainly true that although such code is rarely difficult to write, writing and debugging it is both tedious, and time-consuming, and one typically would rather spend the time on the hard parts of the program. It is also true that while certain transformations are easily thought of, they are often hard to express succinctly in the usual general-purpose programming languages (eg, the description of a floating-point number). LEX is a program which tries to give a programmer a good deal of help in this, by writing large parts of the lexical analyser automatically, based on a description supplied by the programmer of the items to be recognised (which are known as tokens), as patterns of the more basic elements of the input stream. The LEX description is very much a special-purpose language for writing lexical analysers, and LEX is then simply a translator for this language. The LEX language is easy to write, and the resulting processor is compact and fast running. The purpose of a LEX program is to read an input stream, and recognise tokens. As the lexical analyser will usually exist as a subroutine in a larger set of programs, it will return a "token number", which indicates which token was found, and possibly a "token value", which provides more detailed information about possibility; a LEX program is often a good description of the structure of the whole computation, and in such a case, the lexical analyser might choose to call other routines to perform the necessary actions whenever a particular token is recognised, without reference to its own caller. 2.0 The Lex Language LEX transforms a regular-expression grammar into a deterministic finite-state automaton that recognizes that grammar. Each rule of the grammar is associated with an action which is to be performed when that rule successfully matches some part of the input data. Because of the nature of regular expression grammars, certain language constructions cannot be recognized by LEX programs. Specifically, expressions with balanced parentheses cannot be recognized. This means that LEX cannot be used to recognize all Fortran keywords as some (DO, IF, and FORMAT, for example) require elaborate recognition to distinguish between ambiguous constructions. 2.1 Elementary Things Strings, characters, sets of characters called character classes, and operators to form these into patterns, are the fundamental elements of the LEX language. A string is a sequence of characters, not including newline, enclosed in quotes, or apostrophes. Within a string, the following escape sequences (which are those of the C language) allow any 8-bit character to be represented, including the escape character, quotes, and newlines: \n NL (012) \r CR (015) \b BS (010) \t TAB (011) \" " \' ' \c c \\ \ \NNN (NNN) where NNN is a number in octal, and c is any printable character. A string may be continued across a line by writing the escape character before the newline. Outside a string, a sequence of upper-case letters stands for a sequence of the equivalent lower-case letters, while a sequence of lower-case letters is taken as the name of a LEX expression, and handled specially, as described below. These conventions make the use of named expressions and the description of lower-case keywords (the usual case on Unix) fairly convenient. Keywords in case-independent languages, such as Fortran, require additional effort to match, as will be noted. An Ascii character other than one of () {} [ * | = ; % / \ ' " - may be used in LEX to stand for itself. A sequence of characters enclosed by brackets ('[' and ']') forms a character class, which stands for all those characters within the brackets. If a circumflex ('^') follows the opening bracket, then the class will instead stand for all those characters but those inside the brackets. The escapes used in strings may be used in character classes as well. Within a character class, the construction "A-B" (where "A" and "B" are arbitrary characters) stands for the range of characters between "A" and "B" inclusive. For example, "ABC" matches "abc" "[ABC]" matches "A" or "B" or "C" "[A-Za-z0-9]" matches all letters and digits Case-independent keyword recognition may be described by using auxiliary definitions to define expressions that match either case. For example, a = [Aa]; b = [Bb]; ... z = [Zz]; %% d o Matches "DO", "do", "Do", or "dO" 2.2 Putting Things Together Several operators are provided to allow construction of a type of pattern called a regular expression. Such expressions can be implemented as finite-state automata (without memory or stacks). A reference to an "occurrence" of a regular expression is generally taken to mean an occurrence of any string matched by that regular expression. The operators are presented in order of decreasing priority. In all cases, operators work on either strings or character classes, or on other regular expressions. Any string or character class forms a regular expression which matches whatever the string or character class stands for (as described above). The operator '*' applied following a regular expression forms a new regular expression which matches an arbitrary number (ie, zero or more) of adjacent occurrences of the first regular expression. The operation is often referred to as (Kleene) closure. The operation of concatenation of two regular expressions is expressed simply by writing the regular expressions adjacent to each other. The resulting regular expression matches any occurrence of the first regular expression followed directly by an occurrence of the second regular expression. The operator '|', alternation, written between two regular expressions forms a regular expression which matches an occurrence of the first regular expression or an occurrence of the second regular expression. Any regular expression may be enclosed in parentheses to cause the priority of operators to be overridden in the usual manner. A few examples should help to make all of this clear: "[0-9]*" matches a (possibly empty) sequence of digits. "[A-Za-z_$][A-Za-z0-9_$]*" matches a C identifier. "([A-Za-z_$]|[0-9])*" matches a C identifier, or a sequence of digits, or a sequence of letters and digits intermixed, or nothing. 2.3 The General Form of Lex Programs A LEX source input file consists of three sections: a section containing auxiliary definitions, a section containing translations, and a section containing programs. Throughout a LEX program, spaces, tabs, and newlines may be used freely, and PL/1-style comments: /* ... anything but '*/' ... */ may be used, and are treated as a space. The auxiliary definition section must be present, separated from following sections by the two-character sequence '%%', but may be empty. This section allows definition of named regular expressions, which provide the useful ability to use names of regular expressions in the translation section, in place of common sub-expressions, or to make that section more readable. The translation section follows the '%%' sequence, and contains regular expressions paired with actions which describe what the lexical analyser should do when it discovers an occurrence of a given regular expression in its input stream. The program section may be omitted; if it is present it must be separated from the translation section by the '%%' sequence. If present, it may contain anything in general, as it is simply tacked on to the end of the LEX output file. The style of this layout will be familiar to users of Yacc. As LEX is often used with that processor, it seemed reasonable to keep to a similar format. 2.4 Auxiliary Definitions Given the set of regular expressions forming a complete syntax, there are often common sub-expressions. LEX allows these to be named, defined but once, and referred to by name in any subsequent regular expression. Note that definition must precede use. A definition has the form: expression_name = regular_expression ; where a name is composed of a lower-case letter followed by a sequence string of letters and digits, and where an underscore is considered a letter. For example, digit = [0-9]; letter = [a-zA-Z]; name = letter(letter|digit)*; The semicolon is needed to resolve some ambiguities in the LEX syntax. Three auxiliary definitions have special meaning to LEX: "break", "illegal", and "ignore." They are all defined as character classes ("break = [,.?]", for example) and are used as follows: break An input token will always terminate if a member of the "break" class is scanned. illegal The "illegal" class allows simplification of error detection, as will be described in a later section. If this class is defined, and the lexical analyser stops at a character that "cannot" occur in its present context, the analyser will output a suitable error message and ignore the offender. ignore This class defines a set of characters that are ignored by the analyser's input routine. 2.5 Translations One would like to provide a description of the action to be taken when a particular sequence of input characters has been matched by a given regular expression. The kind of action taken might vary considerably, depending upon the application. In a compiler, typical actions are: enter an identifer into a symbol table, read and store a string, or return a particular token to the parser. In text processing, one might wish to reproduce most of the input stream on an output stream unchanged, making substitutions when a particular sequence of characters is found. In general, it is hard to predict what form the action might take, and so, in LEX the nature of the action is left to the user, by allowing specification, for each regular expression of interest, C-language code to be executed when a string matching that expression is discovered by the driving program of the lexical analyser. An action, together with its regular expression, is called a translation, and has the format: regular_expression { action } All of this may be spread across several lines. The action may be empty, but the braces must appear. Earlier, it was argued that most general-purpose languages are inappropriate for writing lexical analysers, and it is important to see that the subsequent use of such a language to form the actions is not a contradiction. Most languages are fairly good at expressing the actions described above (symbol table manipulation, writing character strings, and such). Leaving this part of the lexical analyser to those languages therefore not only makes sense, but also ensures that decisions by the writer of the lexical analyser generator will not unduly cramp the user's style. However, general-purpose languages do not as a rule provide inexpensive pattern matching facilities, or input description formats, appropriate for describing or structuring a lexical analyser. Allowing a user to provide his own code is not really enough, as he will need some help from LEX to obtain a copy of, or a pointer to, the current token, if nothing else. LEX provides a library of C functions which may be called to obtain controlled access to some of the data structures used by the driving programs of the lexical analyser. These are described in a later section. 2.5.1 Numbers and Values - Typically, a lexical analyser will return a value to its caller indicating which token has been found. Within an action, this is done by writing a C return statement, which returns the appropriate value: BEGIN { return(T_BEGIN); } name { lookup(token(NULL)); return(T_NAME); } "/" { return('/'); } Note that function lookup() is provided by the user program. In many cases, other information must be supplied to its caller by the scanner. When an identifier is recognised, for example, both a pointer to a symbol-table entry, and the token number T_NAME must be returned, yet the C return statement can return but a single value. Yacc has a similar problem, and so its lexical analyser sets an external word 'yylval' to the token value, while the token number is returned by the scanner. LEX uses the external 'yylval' (to be compatible), but, to make LEX programs more readable when used alone, the name 'lexval' is set by a #define statement to 'yylval'. For example, name { lexval = lookup(token(NULL)); return(T_NAME); } Certain token numbers are treated specially; these are automatically defined as manifests (see section 3.2) by LEX, and all begin with the sequence 'LEX...' so as not to clash with the user's own names. There are two such tokens defined at present: LEXSKIP When returned by a user's action routine, LEXSKIP causes the lexical analyser to ignore the current token (ie, it does not inform the parser of its presence), and to look instead for a new token. This may be used when a comment sequence has been discovered, and discarded. It is also useful when the action routine completes processing of the token. See the discussion of the comment() library function for an example of its usage. LEXERR This is returned by the lexical analyser (function yylex()) when an unrecognizable input sequence has been detected. By default, LEXERR is 256. This the same as the yacc error value. To summarise, the token number is set by the action with a return statement, and the token value (ie, auxiliary information) is set by assigning this value to the external integer 'lexval'. 2.6 Declaration Sections Declarations in the language of the actions may be included in both the auxiliary definition section and in the translation section. In the former case, these declarations will be external to the lexical analyser, and in the latter case, they will be local to the lexical analyser (ie, static, or automatic storage). Declaration sections consist of a sequence of declarations surrounded by the special bracketing sequences '%{' and '%}' (as in Yacc). The characters within these brackets are copied unchanged into the appropriate spots in the lexical analyser program that LEX writes. The examples in appendix A suggest how these might be used. 3.0 Using Lex from C The present version of LEX is intended for use with C; and it is this usage which will be described here. 3.1 The Function yylex() The structure of LEX programs is influenced by what Yacc requires of its lexical analyser. To begin with, the lexical analyser must be named 'yylex', has no parameters, and is expected to return a token number, where that number is determined by Yacc. The token number for an Ascii character is its Ascii value (ie, its value as a C character constant). Named tokens, defined in yacc '%token' statements, have a number above 256, with the particular number accessible through a Yacc-produced #define of the given token name as its number. Yacc also allows 'yylex' to pass a value to the Yacc action routines, by assigning that value to the external 'yylval'. LEX thus provides a lexical analyser function named 'yylex', which interprets tables constructed by the LEX program returning the token number returned by the actions it performs. Values assigned to lexval are available in 'yylval', so that use with Yacc is straightforward. A value of zero is returned by 'yylex' at end-of-file, and in the absence of a return statement in an action, a non-zero value is returned. If computation is performed entirely by the lexical analyser, then a suitable main program would be main() { while (yylex()) ; } 3.2 Serial Re-Use of yylex() The yylex() function contains several variables which are statically initialized at compile time. Once yylex() sees an EOF (-1) input character, it will continue to return NULL. If yylex() is to be used inside a loop which processes multiple files, it must be re-initialized at the beginning of each new file with a call to the LEX library routine llinit(). For example (slightly extending the previous example): main() { getfilelist(); for(file = first; file != last; file = next) { llinit(); while (yylex()); } printf("All files done\n"); } The call to llinit() is unnecessary if yylex() is to process only one file, or is kept from seeing an EOF input character. 3.3 The Lex Table File In the absence of instructions to the contrary (see below), LEX reads a given LEX language file, (from the standard input, if an input file has not been specified) and produces a C program file 'lextab.c' which largely consists of tables which are then interpreted by 'yylex()' (which is in the LEX library). The actions supplied by the user in each translation are combined with a switch statement into a single function, which is called by the table interpreter when a particular token is found. The contents of the program section of the LEX file are added at the end of the output file (lextab.c by default). Normally, LEX also inserts the lines #include <stdio.h> #include <lex.h> at the top of the file; this causes declarations required by the standard I/O library and by LEX to be included when the C program is compiled. 3.4 Analyzers Which Don't Use "Standard I/O" With the current release, LEX supports the generation of analyzers which may be incorporated into programs which do not use the "standard I/O" library. By setting the "-s" switch, as shown below, the generation of the "#include <stdio.h>" line is supressed. All references to standard I/O specific files and stdio.h have been removed from the LEX library (described in a later section), with the exception of lexgetc(), lexerror(), mapch() and lexecho(), which are standard I/O dependent. The declaration of yylex()'s input file iov pointer "lexin" now resides in LEXGET.C (lexgetc()). The code which defaults lexin to stdin has been moved from yylex() to the table file. yylex() now calls the routine llstin(), which is generated into the table file. There are no longer any hardwired references to the variable "lextab", the default table name. Instead, LEX generates a call to lexswitch() in llstin(), which initializes yylex() to use the table whose name was given in a "-t" or "-e" option in LEX's command line. If neither was given, the default name "lextab" is used. Once the initial table has been set up, further automatic calls to lexswitch() are supressed, allowing the user to manually switch tables as before. In addition, If the "-s" switch is not given (i.e., normal use with standard I/O), llstin() defaults lexin to stdin. If "-s" is given, llstin() is generated to do the lexswitch() mentioned above only. In any case, yylex() contains no references to the standard I/O system. What all of this means is that under normal operation, you won't notice any change in LEX's characteristics. In addition, you may use the "-e" ("easy") switch, which will generate a C output file and LEX tables which (conveniently) have the same name as the input file, and everything will get set up automagically. If you specify the "-s" switch, the table file will contain no references to the standard I/O package, and you may use any of the lexlib routines except lexgetc(), lexerror(), mapch() or lexecho(). Don't forget that you must supply your own startup routine "$$main" if you do not want the standard I/O library. With a bit of care in this regard, it will be possible to link your program with the C library without dragging in any I/O modules. This prevents your having to build another library in order to access non-I/O library functions. Just make the reference to the C library the last one given to the linker or taskbuilder so that only those routines which have not already been found are pulled from CLIB. NOTE Programs that use LEX-generated analyzers and do not use the standard I/O package must supply their own lexgetc() and lexerror() routines. Failure to do so will result in undefined globals. 3.5 Operating LEX LEX normally reads the grammar from the standard input, writing the C program to the file 'lextab.c'. It may be further controlled by using the following flags upon invocation: -i filename The grammar is read from 'filename'. -o filename The analyser is written to 'filename'. -t tablename The default finite-state automaton is named lextab (and it is, by default, written to file 'lextab.c'). The -t switch causes the internal tables to be named 'tablename' and, if the -o switch is not given, written to file 'tablename.c'. This is necessary if the processor-switching capabilities described in a later section are to be used. -e name "Easy" command line. "-e name" is equivalent to typing -i name.LXI -o name.C -t name Do not include device names or file extensions on the "easy" command line. -v [filename] Internal state information is written to 'filename.' If not present, state information is written to file 'lex.out.' -d Enable various debugging printouts. -s Generate analyzer without references to standard I/O The command line for compilation of the table file should contain no surprises: cc -c -O lextab.c (on Unix) xcc lextab -a (on Dec operating systems) but when one is producing the running program, one must be careful to include the necessary libraries. On Unix, the proper sequence is: cc userprog.o lextab.o -ll -lS The '-ll' causes the LEX library (described below) to be searched, and the '-lS' causes the Standard I/O library to be used; both libraries are required. If Yacc is used as well, the library '-ly' should be included before the '-ll'. The actual order and content of the rest of the command line is determined by the user's own requirements. If using the Decus C compiler, the lexical analyser built by LEX is linked with c:lexlib. The complete process (assuming the Decus compiler running on RSTS/E in RT11 mode) is thus: mcr lex -i grammar.lxi -o grammar.c ! Build analyser cc grammar ! Compile the as grammar ! grammar table link out=in,grammar,c:lexlib,c:suport,c:clib/b:2000 4.0 The Lex Library All programs using grammars generated by LEX must be linked together with the LEX library. On Unix, this is '/lib/libl.a' (or '-ll' on the cc command line) and on DEC operating systems, C:LEXLIB (LB:[1,1]LEX.OLB for RSX). It contains routines which are either essential or merely useful to users of LEX. The essential routines include a routine to obtain a copy of the current token, and a routine to switch to a different set of scanning tables. Routines of the second, useful, class perform functions which might well be written by the user himself, but are there to save him the bother; including a routine to process various forms of comments and a routine to transform numbers written in arbitrary bases. Both sets of routines are expected to grow as LEX sees use. Those functions which produce diagnostics do so by calling lexerror(), which is called as lexerror(string, arg1, ..., argN) and is expected to write its arguments (likely using the "remote format" facility of the fprintf() function), followed by a newline, on some output stream. A Lexerror() function is included in the LEX library, but a user is free to include his own. The routine in the LEX library is standard I/O specific. NOTE The VAX/VMS native C library does not support remote formats. The Lexerror function in the LEX library conditionally compiles to support a call to lexerror() with only an error message string. Remote formats are supported under Decus C. Learn to use them, they are very nice! 4.0.1 Comment -- skip over a comment - comment(delim) char delim[]; Comment() may be called by a translation when the sequence of characters which mark the start of a comment in the given syntax has been recognised by LEX. It takes a string which gives the sequence of characters which mark the end of a comment, and skips over characters in the input stream until this sequence is found. Newlines found while skipping characters cause the external 'yyline' to be incremented; an unexpected end-of-file produces a suitable diagnostic. Thus, 'comment("*/")' matches C-style comments, and 'comment("\n")' matches as-style comments. There are other methods of handling comments in LEX; the comment() function is usually the best with regard to both space and time. 4.0.2 Gettoken -- obtain a copy of token - gettoken(buf, sizeof(buf)) char buf[]; Gettoken() takes the address of a character buffer, and its size in bytes, and copies the token most recently matched by LEX into the buffer. A null byte is added to mark the end of the token in the buffer, but, as null bytes are legitimate characters to LEX, the true length of the token is returned by gettoken(). For example, the following function calls lexlength() to obtain the length of a token. It then calls the storage allocator to allocate sufficient storage for the token and copies the token into the allocated area. char * save() /* * Save current token, return a pointer to it */ { register char *tbuffer; register int len; register char *tend; extern char *token(); extern char *copy(); len = lexlength() + 1; if (tbuffer = malloc(len)) == NULL) error("No room for token"); gettoken(tbuffer, len); return(tbuffer); } 4.0.3 Integ -- long integer, any base - long integ(nptr, base) char *nptr; Integ() converts the Ascii string at 'nptr' into a long integer, which it returns. Conversion stops at the first non-digit, where the digits are taken from the class "[0-9a-zA-Z]" as limited by the given 'base'. Integ() does not understand signs, nor are blanks or tabs allowed in the string. 4.0.4 Lexchar -- steal character -- lexchar() Lexchar() returns the next character from the LEX input stream. (This means that LEX will no longer see it.) LEX uses a look-ahead buffer to handle complex languages, and this function takes this into account. 4.0.5 Lexecho -- write token to a file (STDIO ONLY) - lexecho(fp); FILE *fp; Lexecho() may be called by a LEX action routine to write the current token to a specified file. NOTE Programs using analyzers built with LEX's "-s" switch must supply their own lexecho() function if needed. 4.0.6 Lexgetc -- supply characters to yylex (STDIO ONLY) - lexgetc() Lexgetc() is called by the lexical analyser to obtain characters from its input stream. The version in the library is dependent on the standard I/O package, and is: FILE *lexin; /* Declare iov address locally */ lexgetc() { return(getc(lexin)); } If lexin is NULL when yylex() is entered, it will be assigned to stdin. This is done by yylex() calling the function llstin(), which is generated in the table file. Unless the "-s" switch is given to LEX, the llstin() function assigns lexin to stdin if lexin is NULL. If the "-s" switch was given, the llstin() routine is a no-op. The user may provide his own version of lexgetc() to pre-process the data to the lexical analyser. An example of this is shown in the appendix. NOTE Programs using analyzers built with LEX's "-s" switch must supply their own lexgetc() function, and "lexin" has no meaning in this context. 4.0.7 Lexlength -- return length of a token. - lexlength(); Lexlength() may be called by a LEX action routine to obtain the length of the current token in bytes. An example of this is shown in the description of gettoken(). 4.0.8 Lexpeek -- examine character - lexpeek() Lexpeek() performs a function similar to that of Lexchar(), but does not have the side-effect of removing the character from LEX's view. 4.0.9 Lexswitch -- switch scanning tables - struct lextab * lexswitch(newtb) struct lextab *newtb; Lexswitch() is called to cause LEX to use a different scanning table; it returns a pointer to the one previously in use. This facility is useful if certain objects of the language (eg, strings in C) have a fairly complicated structure of their own which cannot be handled within the translation section of the LEX description of the larger language. 4.0.10 Llinit -- Reinitialize yylex() - llinit() Llinit() is a function which resets the state of yylex() to it's cold-start condition. Several of yylex()'s variables are initialized at compile time, and must be reinitialized if it is to be serially re-used. An example of this is where yylex() is repeatedly called inside a loop which processes multiple input files. Each time a new file is started, llinit() must be called before the first call to yylex() for the new file. 4.0.11 Mapch -- Handle C escapes within strings (STDIO ONLY) - int mapch(delim, esc) char delim; char esc; Mapch() is a function which handles C "escape" characters such as "\n" and "\nnn". It will scan off the entire escape sequence and return the equivalent ASCII code as an integer. It is meant for use with YACC while scanning quoted strings and character constants. If it encounters EOF while scanning, it calls lexerror() to print an error message warning of "Unterminated string". If a normal character is read, it returns the ASCII value. If "delim" (usually " or ') is read, it returns EOF. If a newline (ASCII linefeed) is read, it increments the global "yyline" and calls itself recursively for the next line of input. It may use the ungetc() function to back up in the input stream. NOTE This routine is very application-specific for use by LEX and YACC when they are working together. You should read the code in MAPCH.C before using this function. 4.0.12 Token -- get pointer to token - char * token(end_pointer) char **end_pointer; Token() locates the first byte of the current token and returns its address. It takes an argument which is either NULL or a pointer to a character pointer; if the latter, that pointer is set to point to the byte after the last byte of the current token. Token() is slightly faster, and more convenient than gettoken() for those cases where the token is only one or two bytes long. 5.0 Error Detection and Recovery If a character is detected in the input stream which cannot be added to the last-matched string, and which cannot start a string, then that character is considered illegal by LEX. LEX may be instructed to return a special 'error' token, or to write a diagnostic with lexerror(). At present, the former is the default action. The token LEXERR is a special value which is recognised by Yacc, and causes it to start its own error recovery. It is defined by the header file lex.h for use by other programs. Often, it makes more sense to simply type a suitable diagnostic, and continue by ignoring the offending character. It is fairly easy to cause LEX to do this, by including the auxiliary definition: illegal = [\0-\377]; which defines a character class "illegal" which is handled specially by LEX. If the character that is causing the trouble is a member of that character class (and in the example, all characters are), then LEX will write a diagnostic, and ignore it; otherwise, it will return the special token LEXERR More comprehensive techniques may be added as they become apparent. 6.0 Ambiguity and Look-ahead Many computer languages have ambiguous grammars in that an input token may represent more than one logical entity. This section discusses the way in which grammars built by LEX resolve ambiguous input, as well as a way for the grammar to assign unique meaning to a token by looking ahead in the input stream. 6.1 Resolving Ambiguities A LEX program may be ambiguous, in the sense that a particular input string or strings might be matched by the regular expression of more than one translation. Consider, [a-z] { putchar(*token(NULL)); } aaa* { printf("abc"); } in which the string 'aa' is matched by both regular expressions (twice by the first, and once by the second). Also, the string 'aaaaaa' may be matched in many different ways. LEX has to decide somehow which actions should be performed. (Alternatively, it could produce a diagnostic, and give up. As it happens, LEX never does this.) Consider a second example, letter = [a-z]; %% A(letter)* { return(1); } AB(letter)* { return(2); } which attempts to distinguish sequences of letters that begin with 'a' from similar sequences that begin with 'ab'. These two examples illustrate two different kinds of ambiguity, and the following indicates how LEX resolves these. In the first example, it seems likely that the intent was to have both 'aa' and 'aaaaaa' perform the second action, while all single letters 'a' cause the first action to be performed. LEX does this by ensuring that the longest possible part of the input stream will be used to determine the match. Thus, < { return(LESS); } <= { return(LESSEQ); } or digit(digit)* { return(NUMBER); } letter(letter|digit)* { return(NAME); } would work as one might expect. In the second example, the longest-string need not work. On the string "abb9", either action could apply, and so another rule must be followed. This states that if, after the longest-string rule has been applied, there remains an ambiguity, then the action which appears first in the LEX program file is to be performed. As the second example is written, the second action will never be performed. It would have been written as: letter = [a-z]; %% AB(letter)* { return(1); } A(letter)* { return(2); } The two rules together completely determine a string. At present, LEX produces no diagnostic in either case; it merely applies the rules and proceeds. In the case where priority is given to the first-appearing rule, it might be a good idea to produce a diagnostic. 6.2 Look-ahead Some facility for looking ahead in the input stream is sometimes required. (This facility might also be regarded as a way for the programmer to more closely control LEX's ambiguity resolution process.) For example, in C, a name followed by "(" is to be contextually declared as an external function if it is otherwise undefined. In Pascal, look-ahead is required to determine that 123..1234 is an integer 123, followed by the subrange symbol "..", followed by the integer 1234, and not simply two real numbers run together. In both of these cases, the desire is to look ahead in the input stream far enough to be able to make a decision, but without losing tokens in the process. A special form of regular expression is used to indicate look-ahead: re1 '/' re2 '{' action '}' where 're1' and 're2' are regular expressions. The slash is treated as concatenation for the purposes of matching incoming characters; thus both 're1' and 're2' must match adjacently for the action to be performed. 'Re1' indicates that part of the input string which is the token to be returned, while 're2' indicates the context. The characters matched by 're2' will be re-read at the next call to yylex(), and broken into tokens. Note that you cannot write: name = re1 / re2; The look-ahead operator must be part of the rule. It is not valid in definitions. In the first example, the look-ahead operator would be used as: name / "(" { if (name undefined) declare name a global function; } name { /* usual processing for identifiers */ } In the second example, the range construction would be parsed as follows: digit = [0-9]; int = digit(digit)*; %% int / ".." int { /* Start of a range */ ".." int { /* End of a range */ Note that right-context is not sufficient to handle certain types of ambiguity, as is found in several places in the Fortran language. For example, do i = 1 Is an assignment statement do i = 1, 4 Is a DO statement It is not sufficient to use right-context scanning to look for the comma, as it may occur within a parenthesized sub-expression: do i = j(k,l) Is an assignment statement In Fortran, similar problems exist for IF and FORMAT statements, as well as counted (Hollerith) string constants. All of these require a more powerful grammar than is possible with LEX regular-expressions. 7.0 Multiple Scanning Tables; Processor Switching Even a fairly simple syntax may be difficult, or impossible, to describe and process with a single set of translations. An example of this may be found in C, where strings, which are part of the language, have quite a different structure, and in order to process them, either a function must be called which reads and parses the input stream for itself, or some mechanism within LEX must be invoked to cause a (usually massive) change of state. LEX does provide such a facility, which is known, after AED, as 'processor switching'. Yylex() locates its tables through a pointer; if one simply changes the pointer to point at a new set of tables, one will have effected the required change of state. The LEX library function lexswitch(), which is described elsewhere in this guide, arranges to do this; it also returns the old value of the pointer so that it may be restored by a later call to Lexswitch. Thus, scanning environments may be stacked, or not, as the user requires. 7.1 Creation of a Processor It should be clear that if all the tables produced by LEX from a user's translation file have the same name, someone (the loader) is bound to object. Some method must be provided to change the name of the table. This is done by an option flag to the LEX command: -t name will cause the scanning table to be declared as struct lextab name; so that it may be passed to LEXswitch: lexswitch(&name); LEX also writes the program file to the file "name.c" rather than to "lextab.c." NOTE If you use the "easy" command line ("-e name") when running LEX, the output file and table names will correspond nicely. Re-read the section on operating LEX for more details. 8.0 Conclusion LEX seems to handle most lexical analysis tasks easily. Indeed, LEX may be more generally used to write commands of a text-processing nature; an example of such usage may be found in an appendix. LEX programs are far easier to write than the equivalent C programs, and generally consume less space (although there is an initial overhead for the more general table-interpreter program). The encoding suggested in [4] achieves a reasonable compromise between table size, and scanning speed. Certainly lexical analysers are less tedious and time-consuming to write. It is expected that most change in the future will be through additions to the LEX library. The LEX language may change slightly to accomodate common kinds of processing (eg, break characters), or to extend its range of application. Neither kind of change should affect existing LEX programs. LEX produces tables and programs for the C language. The tables are in a very simple (and stylised) format, and when LEX copies the action routines or the program section, the code might as well be Fortran for all it cares. One could write Unix filters to translate the very simple C format tables into other languages, allowing LEX to be used with a larger number of languages, with little extra development cost. This seems a likely future addition. Because of the look-ahead necessary to implement the "longest string match" rule, LEX is unsuitable for interactive programs whose overall structure is: for (;;) { prompt_user(); get_input(); process(); print_output(); } If these are rewritten as LEX-generated grammars, the user will be confused by the fact the second input datum must be entered before the first is processed. It is possible to solve this dilemna by rewriting function lexgetc() to return an "end-of-line" character until processing is complete for that line. An example is shown in the Appendix. 9.0 Acknowledgements LEX is based on a processor of the same name at Bell Laboratories, which also runs under Unix [3], and, more distantly, on AED-0 [1]. This version of LEX was based on the description and suggestions of [4], although the implementation differs significantly in a number of ways. 10.0 References 1. Johnson, W.L., et. al., "Automatic generation of efficient lexical analysers using finite state techniques", CACM Vol. 11, No. 12, pp. 805-813, 1968. 2. Johnson, S.C., "Yacc -- Yet Another Compiler-Compiler", CSTR-32, Bell Telephone Laboratories, Murray Hill, New Jersey, 1974. 3. Lesk, M.E., "Lex - a lexical analyser generator", CSTR-39, Bell Telephone Laboratories, Murray Hill, New Jersey, 1975. 4. Aho, A.V., Ullman, J.D., Principles of Compiler Design, Addison-Wesley, Don Mills, Ontario, 1977. APPENDIX A LEX SOURCE GRAMMAR The following is a grammar of LEX programs which generally follows Bacus-Naur conventions. In the rules, "||" stands for alternation (choose one or the other). Other graphic text stands for itself. Several grammar elements have special meaning: <anything> Any text not including the following grammar element (either a literal or end-of-file). <nothing> Nothing -- used for optional rule elements. <name> A variable name. <char_class> A character class specifier. <string> A string (text inclosed in '"'). <EOF> The end of the input file. This grammar was abstracted from the Yacc grammar used to describe LEX. program :== aux_section trans_section aux_section ::= auxiliaries %% || %% auxiliaries ::= auxiliaries aux_def || aux_def aux_def ::= name_def = reg_exp ; || %{ <anything> %} name_def ::= <name> reg_exp ::= <char_class> || <string> || <name> || reg_exp * || reg_exp | reg_exp || reg_exp reg_exp || ( reg_exp ) trans_section ::= translations || <nothing> translations ::= translations translation || translation translation ::= pattern action || %{ <anything> %} || %% <anything> <EOF> pattern ::= reg_exp / reg_exp || reg_exp APPENDIX B SOME SMALL EXAMPLES The following example illustrates the use of the look-ahead operator, and various other of the nuances of using LEX. B.1 A Complete Command The C programming language has had two different ways of writing its assignment operators. The original method was to write a binary operator immediately following the ordinary assignment operator, forming a compound operator. Thus 'a =+ b' caused the value of 'a+b' to be assigned to 'a'. Similarly, =- =/ =% =* =<< =>> =| =& =^ were written for the assignment operators corresponding to subtraction, division, modulus, multiplication, left shift, right shift, logical OR, logical AND, and exclusive OR. In the current version of the language, the binary operator is written to the left of the assignment operator, to remove potential ambiguity. The LEX program "ctoc" is a filter which converts programs written in the older style into programs written in the newer style. It uses the look-ahead operator, and the various dis-ambiguating rules to ensure that sequences like a==-1 a=++b remain unchanged. /* * ctoc.lxi -- Convert old C operators to new C form * * Adapted from example in C. Forsythe's LEX manual. * * NOTE: * Forsythe's program put an entire comment into the token * buffer. Either define a huge token buffer for my typical * monster comments, or filter text within comments as if * it were real C code. This is what I did. So =+ inside * a comment will get changed to +=, etc. Note tnat you * may use the commen() function in LEXLIB if you want the * comments eaten. I wanted 'em in the output. * by * Bob Denny * 31-Feb-81 */ %{ char tbuf[80]; /* Token buffer */ main() { while (yylex()) ; } %} any = [\0-\177]; nesc = [^\\]; nescquote = [^\\"]; nescapost = [^\\']; schar = "\\" any | nescquote; cchar = "\\" any | nescapost; string = '"' schar* '"'; charcon = "'" cchar* "'"; %% "=" ( << | >> | "*" | + | - | "/" | "%" | "&" | "|" | "^" ) { gettoken(tbuf, sizeof tbuf); printf("%s=",tbuf+1); } /* * The following will overflow the token buffer on any but a * small comment: */ /********* "/*" ([^*] | "*"[^/])* "*/" { lexecho(stdout); } **********/ [<=>!]"=" | "="[<>] { lexecho(stdout); } "=" / ( ++ | -- ) { lexecho(stdout); } charcon { lexecho(stdout); } string { lexecho(stdout); } [\0-\377] { lexecho(stdout); } Assuming the Decus compiler running on RSTS/E in RT11 mode, the above program would be built and executed as follows: mcr lex -i ctoc.lxi -o ctoc.c cc ctoc/v as ctoc/d link ctoc=ctoc,c:lexlib,c:suport,c:clib/b:2000 mcr ctoc <old.c >new.c B.2 Interactive Lexical Analysis The following program reads words from the terminal, counting each as they are entered. The interaction with the operator is "natural" in the sense that processing for one line is complete before the next line is input. To implement this program, it was necessary to include a special version of lexgetc() which returns <NULL> if the current line has been completely transmitted to the parser. Because the parser must still have some look-ahead context, it will return the "end-of-line" token twice at the beginning of processing. This required some additional tests in the main program. /* * Count words -- interactively */ white = [\n\t ]; /* End of a word */ eol = [\0]; /* End of input line */ any = [!-~]; /* All printing char's */ illegal = [\0-\377]; /* Skip over junk */ %{ char line[133]; char *linep = &line; int iseof = 0; int wordct = 0; #define T_EOL 1 main() { register int i; while ((i = yylex()) != 0) { /* * If the "end-of-line" token is * returned AND we're really at * the end of a line, read the * next line. Note that T_EOL is * returned twice when the program * starts because of the nature of * the look-ahead algorithms. */ if (i == T_EOL && !is_eof && *linep == 0) { printf("* "); fflush(stdout); getline(); } } printf("%d words\n", wordct); } %} %% any(any)* { /* * Write each word on a * seperate output line. */ lexecho(stdout); printf("\n"); wordct++; return(LEXSKIP); } eol { return(T_EOL); } white(white)* { return(LEXSKIP); } %% getline() /* * Read a line for lexgetc() */ { is_eof = (fgets(line, sizeof line, stdin) == NULL); linep = &line; } lexgetc() /* * Homemade lexgetc -- return zero while at the * end of an input line or EOF at end of file. If * more on this line, return the next byte. */ { return( (is_eof) ? EOF : (*linep == 0) ? 0 : *linep++); }